Cross-Speaker Emotion Disentangling and Transfer for End-to-End Speech Synthesis
نویسندگان
چکیده
The cross-speaker emotion transfer task in text-to-speech (TTS) synthesis particularly aims to synthesize speech for a target speaker with the transferred from reference recorded by another (source) speaker. During process, identity information of source could also affect synthesized results, resulting issue leakage, i.e., synthetic may have voice rather than This paper proposes new method aim controllable emotional expressive and meanwhile maintain speaker’s TTS task. proposed is Tacotron2-based framework embedding as conditioning variable provide information. Two disentangling modules are contained our 1) get speaker-irrelevant emotion-discriminative embedding, 2) explicitly constrain be that expected. Moreover, we present an intuitive control strength Specifically, learned adjusted flexible scalar value, which allows controlling conveyed embedding. Extensive experiments been conducted on Mandarin disjoint corpus, results demonstrate able reasonable Compared state-of-the-art methods, gets best performance task, indicating achieves learning Furthermore, ranking test pitch trajectories plots can effectively strength, leading prosody-diverse speech.
منابع مشابه
End-to-End Neural Speech Synthesis
In recent years, end-to-end neural networks have become the state of the art for speech recognition tasks and they are now widely deployed in industry (Amodei et al., 2016). Naturally, this has led to the creation of systems to do the opposite – end-to-end speech synthesis from raw text. Very recently, neural TTS systems have become highly competitive with their conventional counterparts, showi...
متن کاملTacotron: Towards End-to-End Speech Synthesis
A text-to-speech synthesis system typically consists of multiple stages, such as a text analysis frontend, an acoustic model and an audio synthesis module. Building these components often requires extensive domain expertise and may contain brittle design choices. In this paper, we present Tacotron, an end-to-end generative text-to-speech model that synthesizes speech directly from characters. G...
متن کاملChar2wav: End-to-end Speech Synthesis
We present Char2Wav, an end-to-end model for speech synthesis. Char2Wav has two components: a reader and a neural vocoder. The reader is an encoderdecoder model with attention. The encoder is a bidirectional recurrent neural network that accepts text or phonemes as inputs, while the decoder is a recurrent neural network (RNN) with attention that produces vocoder acoustic features. Neural vocode...
متن کاملTowards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron
We present an extension to the Tacotron speech synthesis architecture that learns a latent embedding space of prosody, derived from a reference acoustic representation containing the desired prosody. We show that conditioning Tacotron on this learned embedding space results in synthesized audio that matches the prosody of the reference signal with fine time detail even when the reference and sy...
متن کاملStyle Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis
In this work, we propose “global style tokens” (GSTs), a bank of embeddings that are jointly trained within Tacotron, a state-of-the-art end-toend speech synthesis system. The embeddings are trained with no explicit labels, yet learn to model a large range of acoustic expressiveness. GSTs lead to a rich set of significant results. The soft interpretable “labels” they generate can be used to con...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: IEEE/ACM transactions on audio, speech, and language processing
سال: 2022
ISSN: ['2329-9304', '2329-9290']
DOI: https://doi.org/10.1109/taslp.2022.3164181